2023-07-14

Formula 1

Formula 1 is the highest class of international racing for open-wheel single-seater formula racing cars. The FIA Formula 1 World Championship has been one of the premier forms of racing around the world since 1950.

The word formula refers to the set of rules to which all participants’ cars must conform and the season consists of a series of races, known as Grands Prix (GP), in multiple countries and continents around the world.

Contents

  1. Problem and Objective
  2. Data preparation
  3. Different type of Regressions
  4. Finding and conclusion

Questions

Is it possible to predict the number of total points each driver will finish the Formula 1 Championship with?

Dataset

Ergast API

The Ergast Developer API is an experimental web service which provides a historical record of motor racing data for non-commercial purposes. The API provides data for the Formula One series, from the beginning of the world championships in 1950 to today. It’s even possible to download the database tables in CSV format or the SQL image.

Each CSV file contains a single database table and the first line of each file contains the column headers. The tables are described in the User Guide.

Results

raceId driverId constructorId grid positionOrder points laps fastestLapTime fastestLapSpeed statusId status
18 1 1 1 1 10 58 1:27.452 218.300 1 Finished
18 2 2 5 2 8 58 1:27.739 217.586 1 Finished
18 3 3 7 3 6 58 1:28.090 216.719 1 Finished
18 4 4 11 4 5 58 1:28.603 215.464 1 Finished
18 5 1 3 5 4 58 1:27.418 218.385 1 Finished
18 6 3 13 6 3 57 1:29.639 212.974 11 +1 Lap

Drivers

raceId driverId points position wins dob driverName
18 1 10 1 1 1985-01-07 Lewis Hamilton
18 2 8 2 0 1977-05-10 Nick Heidfeld
18 3 6 3 0 1985-06-27 Nico Rosberg
18 4 5 4 0 1981-07-29 Fernando Alonso
18 5 4 5 0 1981-10-19 Heikki Kovalainen
18 6 3 6 0 1985-01-11 Kazuki Nakajima

Constructors

raceId constructorId points position wins name
18 1 14 1 1 McLaren
18 2 8 3 0 BMW Sauber
18 3 9 2 0 Williams
18 4 5 4 0 Renault
18 5 2 5 0 Toro Rosso
18 6 1 6 0 Ferrari

Races and circuits

raceId year round date circuitId name
1 2009 1 2009-03-29 1 Australian Grand Prix
2 2009 2 2009-04-05 2 Malaysian Grand Prix
3 2009 3 2009-04-19 17 Chinese Grand Prix
4 2009 4 2009-04-26 3 Bahrain Grand Prix
5 2009 5 2009-05-10 4 Spanish Grand Prix
6 2009 6 2009-05-24 6 Monaco Grand Prix

Merging informations to a single dataframe

raceId year date circuitId gpName driverId driverName dob constructorId constructorName grid positionOrder points laps fastestLapTime fastestLapSpeed statusId status round
18 2008 2008-03-16 1 Australian Grand Prix 1 Lewis Hamilton 1985-01-07 1 McLaren 1 1 10 58 1:27.452 218.300 1 Finished 1
18 2008 2008-03-16 1 Australian Grand Prix 2 Nick Heidfeld 1977-05-10 2 BMW Sauber 5 2 8 58 1:27.739 217.586 1 Finished 1
18 2008 2008-03-16 1 Australian Grand Prix 3 Nico Rosberg 1985-06-27 3 Williams 7 3 6 58 1:28.090 216.719 1 Finished 1
18 2008 2008-03-16 1 Australian Grand Prix 4 Fernando Alonso 1981-07-29 4 Renault 11 4 5 58 1:28.603 215.464 1 Finished 1
18 2008 2008-03-16 1 Australian Grand Prix 5 Heikki Kovalainen 1981-10-19 1 McLaren 3 5 4 58 1:27.418 218.385 1 Finished 1
18 2008 2008-03-16 1 Australian Grand Prix 6 Kazuki Nakajima 1985-01-11 3 Williams 13 6 3 57 1:29.639 212.974 11 +1 Lap 1

Calculate new features to add to the dataframe

  • winRate: Likelihood of winning a race for every driver
  • qualiRate: Likelihood of qualifying first for every driver
driverName winRate qualiRate
Lee Wallard 50.00000 0.00000
Juan Fangio 41.37931 50.00000
Bill Vukovich 40.00000 20.00000
Alberto Ascari 36.11111 38.88889
Jim Clark 34.24658 46.57534
Lewis Hamilton 32.28840 32.28840
Michael Schumacher 29.54545 22.07792
Jackie Stewart 27.00000 17.00000
Ayrton Senna 25.30864 40.12346
Alain Prost 25.24752 16.33663
Max Verstappen 24.41860 15.11628
Stirling Moss 21.91781 23.28767
Bob Sweikert 20.00000 0.00000
Damon Hill 18.03279 16.39344
Sebastian Vettel 17.66667 19.00000

Heatmap

Analysis

Classify races in first and second half of the championship

Why points are concentrated near the 0?

Let’s find out:

  • the number of points awarded to the 1st place over the years
  • the number of races on the calendar over the years
  • the current point system for every race

Number of points awarded to the 1st place over the years

Number of races run for each championship

The current point system introduced by the FIA

position points
1 25
2 18
3 15
4 12
5 10
6 8
7 6
8 4
9 2
10 1

and 1 point is added to the driver who made the fastest lap during the race.

Training Dataframe

Calculate:

  • new points using the last point system introduced by the FIA
  • average points scored by each driver
  • copy the qualifying and win rates to the training dataframe

Training dataframe

year driverId races qualiRate winRate driverName firstHalfPoints allPoints new_firstHalfPoints new_allPoints averagePoints
2022 830 172 15.11628 24.41860 Max Verstappen 192 433.0 189 428 247.9375
2018 1 319 32.28840 32.28840 Lewis Hamilton 188 408.0 188 408 267.5000
2019 1 319 32.28840 32.28840 Lewis Hamilton 225 413.0 223 407 267.5000
2013 20 300 19.00000 17.66667 Sebastian Vettel 172 397.0 172 397 195.4000
2021 830 172 15.11628 24.41860 Max Verstappen 184 388.5 181 396 247.9375
2011 20 300 19.00000 17.66667 Sebastian Vettel 216 392.0 216 392 195.4000

Plot of the new point side by side with the old ones

Heatmap of the training data

Linear Regression

Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables where the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data.

with just half the points

R-squared
Original Points New Points
0.9662199 0.9288745

with more features

R-squared
Original Points New Points
0.9679754 0.9350976

Polynomial Regression

Polynomial regression is a form of regression analysis where the relationship between the independent variable and the dependent variable is modeled as an nth degree polynomial in x. It fits a nonlinear relationship between the value of x and the corresponding conditional mean of y.

with one feature

R-squared
Original Points New Points
0.9666189 0.9289899

with more features

R-squared
Original Points New Points
0.8625396 0.8817937

Bayesian linear regression

With the Bayesian regression, we formulate linear regression using probability distributions rather than point estimates. The response, is not estimated as a single value, but is assumed to be drawn from a probability distribution.

with one feauture

Variance
Original Points New Points
0.9661596 0.9287271

with more features

Variance
Original Points New Points
0.9678819 0.9349488

Findings and Conclusion

Comparison between all the models

Comparison between the R-squared of all the linear and polynomial regressions and using the variance for the bayesan linear regressions divided by original and new points systems.

Original Points New Points
Linear Regression 0.9662199 0.9288745
Linear Regression with more features 0.9679754 0.9350976
Polynomial Regression 0.9666189 0.9289899
Polynomial Regression with more features 0.8625396 0.8817937
Bayesan Linear Regression 0.9661596 0.9287271
Bayesan Linear Regression with more features 0.9678819 0.9349488

Predict values for the 2023 season

Using the best suited model, the Linear regression model with more features in this case, and assuming that we are in the second half of the championship we can predict the points in the end will be these:

Driver Name Predicted points
Max Verstappen 406
Sergio Pérez 251
Fernando Alonso 234
Lewis Hamilton 196
Carlos Sainz 136
George Russell 125
Charles Leclerc 123
Lance Stroll 72
Esteban Ocon 55
Lando Norris 45
Pierre Gasly 30
Alexander Albon 13
Nico Hülkenberg 11
Valtteri Bottas 9
Oscar Piastri 9
Guanyu Zhou 7
Kevin Magnussen 4
Yuki Tsunoda 4
Nyck de Vries 0
Logan Sargeant 0